<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>Prex Kernel Internals</title>
  <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
  <meta name="keywords" content="Prex, embedded, real-time, operating system, RTOS, open source, free">
  <meta name="author" content="Kohsuke Ohtani">
  <link rel="stylesheet" type="text/css" href="../default.css" media="screen">
  <link rel="stylesheet" type="text/css" href="../print.css" media="print">
</head>
<body>
<div id="top">
</div>
<div id="middle">

<table id="content" cellpadding="0" cellspacing="0">
  <tbody>

    <tr>
      <td id="header" colspan="2" valign="top">
        <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr>
          <td id="logo">
            <a href="http://prex.sourceforge.net/">
            <img alt="Prex logo" src="../img/logo.gif" border="0"
            style="width: 250px; height: 54px;"></a>
          </td>
          <td id="brief" align="right" valign="bottom">
            An Open Source, Royalty-free,<br>
	    Real-time Operating System
          </td>
        </tr>
        </table>
      </td>
    </tr>

    <tr>
      <td id="directory" style="vertical-align: top;">
      <a href="http://prex.sourceforge.net/">Prex Home</a> >
      <a href="index.html">Document Index</a> >
      Kernel Internals
    </tr>
    <tr><td class="pad" colspan="2" style="vertical-align: top;"></td></tr>

    <tr>
      <td id="doc" style="vertical-align: top;">

      <h1>Prex Kernel Internals</h1>

<i>Version 1.6, 2008/09/17</i>

<h3>Table of Contents</h3>
<ul>
  <li><a href="#intro">Introduction</a></li>
  <li><a href="#design">Design Philosophy</a></li>
  <li><a href="#over">Kernel Overview</a></li>
  <li><a href="#thread">Thread</a></li>
  <li><a href="#task">Task</a></li>
  <li><a href="#sched">Scheduler</a></li>
  <li><a href="#memory">Memory Management</a></li>
  <li><a href="#ipc">IPC</a></li>
  <li><a href="#except">Exception Handling</a></li>
  <li><a href="#int">Interrupt Framework</a></li>
  <li><a href="#timer">Timer</a></li>
  <li><a href="#device">Device I/O Service</a></li>
  <li><a href="#mutex">Mutex</a></li>
  <li><a href="#debug">Debug</a></li>
  <li><a href="#arch">Architecture Dependent Layer</a></li>
</ul>


<h2 id="intro">Introduction</h2>

<p>
This document describes the design and implementation of the Prex kernel.
</p>
<p>
For a full description of the Prex kernel interface,
see the following documents.
</p>
<ul>
  <li><a href="kapi.html">Prex Kernel API reference</a></li>
  <li><a href="arch.html">Architecture Dependent Interface</a></li>
</ul>



<h2 id="design">Design Philosophy</h2>
<p>
The Prex kernel focuses the following points to be designed.
</p>
<ul>
  <li>Portability</li>
  <li>Scalability</li>
  <li>Reliability</li>
  <li>Interoperability</li>
  <li>Maintainability</li>
</ul>

<h3>Portability</h3>
<p>
Portability is the most important point for the kernel design in Prex.
The Prex kernel is divided into two different layers -
a common kernel layer and an architecture dependent layer.

Any routine in the common kernel layer must not access to the H/W by itself.
Instead, it must use the H/W access services provided by
the architecture dependent layer.
</p>

<h3>Scalability</h3>
<p>
In order to obtain higher scalability, the kernel does not limit the maximum
number of the kernel objects to create. So, the resources for all kernel
objects are allocated dynamically after system boot.
This can keep the memory prerequisite smaller than the static resource
allocation.
This means that the kernel can create any numbers of threads, objects, devices,
events, mutexes and timers as far as usable memory remains.
</p>
<p>
The kernel supports both of MMU and MMU-less systems. So, most components
of the kernel are designed carefully to work without MMU.
</p>

<h3>Reliability</h3>
<p>
When the remaining memory is exhausted, what should OS do?
If the system can stop with panic() there, we can prevent many error checks
in the kernel.
But obviously, this is not allowed on the reliable system.
Even if the memory is exhausted, a kernel must continue to do its jobs.
So, all kernel codes are always checking the error status returned
by the memory allocation routine.
</p>
<p>
In addition, the kernel must not crush anytime even if any invalid parameter
is passed via kernel API. Basically, the Prex kernel code is written with
"garbage in, error out" principle.
The Prex kernel never stops even if any malicious program is loaded.
</p>

<h3>Interoperability</h3>
<p>
Although the Prex kernel was written from scratch, its applications
will be brought from the other operating systems like BSD. So, the system
call interface is designed with consideration to support generic OS
API like POSIX or APIs for generic RTOS.
</p>
<p>
The error code for the Prex system call is defined as the same name
with POSIX. For example, EINVAL for "Invalid argument", or ENOMEM for
"Out of memory". So, peoples do not have to study new error codes if
they already have skills about POSIX programming.
This is important point to write applications and to read
the kernel code because study of a new error scheme will cause pain
for developers.

In addition, it simplifies the POSIX emulation library because it
does not have to remap the error code.
</p>

<h3>Maintainability</h3>
<p>
All kernel codes are kept clean and simple for the maintenance.
All codes are well-commented and consistent.
It is easy to add or remove new system call into the kernel.
The kernel has the debugging facility like the diagnostic message or
the dump of the kernel objects.
</p>



<h2 id="over">Kernel Overview</h2>

<h3>Kernel Structure</h3>
<p>
The following figure illustrates the Prex kernel structure.
</p>
<p>
<img alt="Kernel Components" src="img/kernel.gif" border="1"
style="width: 602px; height: 414px;"><br>

<i><b>Figure 1. Prex kernel Structure</b></i>
</p>
<p>
Each kernel object belongs in one of the following groups.
</p>
<ul>
  <li><b>kern</b>: kernel core components</li>
  <li><b>mem</b>: memory managers</li>
  <li><b>ipc</b>: inter process communication (*)</li>
  <li><b>sync</b>: synchronize objects</li>
  <li><b>arch</b>: architecture dependent components</li>
</ul>
<p>
<i>*) Since all messages in Prex are transferred among threads, the name of
"IPC" is not appropriate.
However, "IPC" is still used as a general term of the message transfer
via the kernel, in Prex.</i>
</p>

<h3>Naming Convention</h3>
<p>
The name of "group/object" in figure 1 is mapped to "directory/file" in the
Prex source tree. For example, the thread related functions are located
in "kern/thread.c", and the functions for semaphore are placed
in "sync/sem.c".
</p>
<p>
In addition, there is a standard naming convention about kernel
routines. The method named <i>bar</i> for the object named <i>foo</i>
should be named "foo_bar". For example, the routine to create a new
thread is named "thread_create", and locking mutex will be "mutex_lock".
This rule is not applied to the name for the local (private) functions.
</p>


<h3>Kernel Service</h3>

<p>
The Prex microkernel has kernel calls to support the following:
</p>

<ul>
  <li>threads and tasks</li>
  <li>message passing</li>
  <li>timers</li>
  <li>exceptions</li>
  <li>device i/o</li>
  <li>mutexes</li>
  <li>semaphores</li>
  <li>condition variables</li>
  <li>debug and log service</li>
</ul>


<h2 id="thread">Thread</h2>

<h3>Thread Control Block</h3>
<p>
The thread control block includes data for
owner task, scheduler, timer, IPC, exception, mutex, and context.
The following thread structure is most important definition
in the kernel codes.
</p>
<pre>
struct thread {
        int             magic;          /* magic number */
        task_t          task;           /* pointer to owner task */
        struct list     task_link;      /* link for threads in same task */
        struct queue    link;           /* linkage on scheduling queue */
        int             state;          /* thread state */
        int             policy;         /* scheduling policy */
        int             prio;           /* current priority */
        int             baseprio;       /* base priority */
        int             timeleft;       /* remaining ticks to run */
        u_int           time;           /* total running time */
        int             resched;        /* true if rescheduling is needed */
        int             locks;          /* schedule lock counter */
        int             suscnt;         /* suspend counter */
        struct event    *slpevt;        /* sleep event */
        int             slpret;         /* sleep result code */
        struct timer    timeout;        /* thread timer */
        struct timer    *periodic;      /* pointer to periodic timer */
        uint32_t        excbits;        /* bitmap of pending exceptions */
        struct queue    ipc_link;       /* linkage on IPC queue */
        void            *msgaddr;       /* kernel address of IPC message */
        size_t          msgsize;        /* size of IPC message */
        thread_t        sender;         /* thread that sends IPC message */
        thread_t        receiver;       /* thread that receives IPC message */
        object_t        sendobj;        /* IPC object sending to */
        object_t        recvobj;        /* IPC object receiving from */
        struct list     mutexes;        /* mutexes locked by this thread */
        struct mutex    *wait_mutex;    /* mutex pointer currently waiting */
        void            *kstack;        /* base address of kernel stack */
        struct context  ctx;            /* machine specific context */
};</pre>

<h3>Thread Creation</h3>
A thread can be created by thread_create().
The initial states of newly created thread are as follows:
<p>
</p>

<i><b>Table 1. Initial thread state</b></i>
<table border="1" width="60%" cellspacing="0">
<tbody>
<tr>
  <th>Data type</th>
  <th>Initial state</th>
</tr>
<tr>
  <td>Owner Task</td>
  <td>Inherit from parent thread</td>
</tr>
<tr>
  <td>Thread state</td>
  <td>Suspended</td>
</tr>
<tr>
  <td>Suspend count</td>
  <td>Task suspend count + 1</td>
</tr>
<tr>
  <td>Scheduling policy</td>
  <td>Round Robin</td>
</tr>
<tr>
  <td>Scheduling Priority</td>
  <td>Default (= 200)</td>
</tr>
<tr>
  <td>Time quantum</td>
  <td>Default (= 50 msec)</td>
</tr>

<tr>
  <td>Processor registers</td>
  <td>Default value</td>
</tr>

</tbody>
</table>

<p>
Since new thread is initially set to the suspended state, thread_resume()
must be called to start it.
</p>

Creating a thread and loading its register state are isolated
in different routines. These two routines are used by fork(), exec(),
and pthread_create() in the POSIX emulation library.
</p>

<i><b>Table 2. Usage of thread_create()/thread_load()</b></i>
<table border="1" width="80%" cellspacing="0">
<tbody>
<tr>
  <th>Library routine</th>
  <th>thread_create()</th>
  <th>thread_load()</th>
</tr>
<tr>
  <td>fork()</td>
  <td align="center">O</td>
  <td align="center">X</td>
</tr>
<tr>
  <td>exec()</td>
  <td align="center">X</td>
  <td align="center">O</td>
</tr>
<tr>
  <td>pthread_create()</td>
  <td align="center">O</td>
  <td align="center">O</td>
</tr>
</tbody>
</table>

<p>
The address for the stack pointer is also set by thread_load().
Since the Prex kernel does not allocate any stack buffer for user mode threads,
the parent thread has responsible to allocate it.
</p>


<h3>Thread Termination</h3>
<p>
The kernel will usually release all resources owned by the terminated thread.
But, there are some complicated processes to release the resources.
For example, the priority adjustment may be required if the thread inherits its
priority.
</p>
<p>
If the thread is terminated with mutex locked, all threads waiting for
that mutex will sleep forever. So, the mutex held by the terminated thread
must be unlocked, or change its mutex owner if some thread is waiting for.

</p>
<p>
In general, there is a known issue about the thread termination.
If the termination target is the current thread, the kernel can not release
the context of the current thread because the
thread switching always requires current context.
There are the following 3 solutions for this.
</p>
<ul>
 <li>1) Create "clean up thread" to terminate a thread</li>
 <li>2) Add condition check in thread switching code</li>
 <li>3) Defer termination until next termination request</li>
</ul>
The Prex kernel is using #3.

<h3>Thread Suspension</h3>
<p>
Each thread can be set to the suspended state by using thread_suspend().
Although a thread can be suspended any number of times,
it does not start to run unless it is resumed by the same
number of suspend.
</p>

<h3>Kernel Thread</h3>
<p>
A kernel thread is always executed in kernel mode, and it does not have user
mode context.
The scheduling policy is set to SCHED_FIFO by default.
</p>
<p>
Currently, the following kernel threads are running in kernel mode.
</p>
<ul>
  <li>Interrupt Service Threads</li>
  <li>Timer Thread</li>
  <li>Idle Thread</li>
  <li>DPC Thread</li>
</ul>

<h3>Idle Thread</h3>
<p>
An idle thread runs when no other thread is active.
It has the role of cutting down the power consumption of a system.
An idle thread has FIFO scheduling policy, and it does not
have time quantum.
The lowest scheduling priority (=255) is reserved for an idle thread.
</p>
<p>
An idle thread is just a forever-loop to call the machine
dependent routine to cut power. The following thread_idle() routine
is called at the end of the kernel initialization.
</p>
<pre>
void
thread_idle(void)
{

        for (;;) {
                machine_idle();
                sched_yield();
        }
}
</pre>
<p>
The machine_idle() routine will program the platform H/W to
the low power mode. This is typically
invoking the power saving (halt) instruction
supported by the processor.
If any interrupts are occurred in this low power mode, it must be returned
immediately from machine_idle(). Then, the idle thread will
call sched_yield() to check the re-scheduling.
</p>


<h2 id="task">Task</h2>

<h3>Task Creation</h3>
<p>
The task can be created by using task_create().
New child task will have the same memory image with the parent task.
Especially text region and read-only region are physically
shared among them.
The parent task receives the new task ID of child task from task_create(), but
child task will receive 0 as task ID.
</p>
<p>
The initial task states are as follows:
</p>

<i><b>Table 3. Initial task state</b></i>
<table border="1" width="60%" cellspacing="0">
<tbody>
<tr>
  <th>Data type</th>
  <th>Inherit from parent task?</th>
</tr>

<tr>
  <td>Object List</td>
  <td align="center">No</td>
</tr>

<tr>
  <td>Threads</td>
  <td align="center">No</td>
</tr>

<tr>
  <td>Memory Map</td>
  <td align="center">Yes</td>
</tr>

<tr>
  <td>Suspend Count</td>
  <td align="center">No</td>
</tr>

<tr>
  <td>Exception Handler</td>
  <td align="center">Yes</td>
</tr>

</tbody>
</table>

<p>
If the parent task is specified as NULL for task_create(),
all child state are initialized to default.
This is used in exec() emulation.
</p>

<h3>Task Suspension</h3>
<p>
When the task is set to suspend state, the thread suspend count of all threads
in the task is also incremented.
A thread can start to run only when both of the thread suspend count
and the task suspend count becomes 0.
</p>

<h3>Kernel Task</h3>
<p>
A kernel task is the special task that has only an idle thread
and the interrupt threads. It does not have any user mode memory.
</p>

<h3>Task Capability</h3>
<p>
Prex supports a security framework named "task capability".
Each task will be assigned its own capabilities for various operations.
When a task tries to do a privileged operation, the
kernel, device drivers and system servers
will check the appropriate bit in the task capability.
</p>

<i><b>Table 4. Task capabilities</b></i>
<table border="1" width="80%" cellspacing="0">
<tbody>
<tr>
  <th>Capability</th>
  <th>Description</th>
</tr>

<tr>
  <td>CAP_SETCAP</td>
  <td>Allow setting capability.</td>
</tr>

<tr>
  <td>CAP_TASK</td>
  <td>Allow controlling another task's execution.</td>
</tr>

<tr>
  <td>CAP_MEMORY</td>
  <td>Allow touching another task's memory.</td>
</tr>

<tr>
  <td>CAP_KILL</td>
  <td>Allow raising exception to another task.</td>
</tr>

<tr>
  <td>CAP_SEMAPHORE</td>
  <td>Allow accessing another task's semaphore.</td>
</tr>

<tr>
  <td>CAP_NICE</td>
  <td>Allow changing scheduling parameter.</td>
</tr>

<tr>
  <td>CAP_IPC</td>
  <td>Allow accessing another task's IPC object.</td>
</tr>

<tr>
  <td>CAP_DEVIO</td>
  <td>Allow device I/O operation</td>
</tr>

<tr>
  <td>CAP_POWER</td>
  <td>Allow power control including shutdown.</td>
</tr>

<tr>
  <td>CAP_TIME</td>
  <td>Allow setting system time.</td>
</tr>

<tr>
  <td>CAP_RAWIO</td>
  <td>Allow direct I/O access.</td>
</tr>

<tr>
  <td>CAP_DEBUG</td>
  <td>Allow debugging requests.</td>
</tr>

<tr>
  <td>CAP_ADMNIN</td>
  <td>Allow mount,umount,sethostname,setdomainname,etc.</td>
</tr>

<tr>
  <td>CAP_EXEC</td>
  <td>Allow executing any file.</td>
</tr>

<tr>
  <td>CAP_FS_READ</td>
  <td>Allow reading any file.</td>
</tr>

<tr>
  <td>CAP_FS_WRITE</td>
  <td>Allow writing any file.</td>
</tr>

</tbody>
</table>

<h2 id="sched">Scheduler</h2>
<h3>Thread Priority</h3>
<p>
The Prex scheduler is based on the algorithm known as priority based
multi level queue. Each thread is assigned the priority between
0 and 255. The lower number means higher priority like BSD UNIX.
It maintains 256 level run queues mapped to each priority.
The lowest priority (=255) is used only for an idle thread.
</p>
<p>
A thread has two different types of priority:
</p>
<ul>
  <li><b>Base priority:</b>
  This is a static priority which can be changed only by user mode program.
  <li><b>Current Priority:</b>
  An actual scheduling priority.
  A kernel may adjust this priority dynamically if it's needed.
</ul>
<p>
Although the base priority and the current priority are same value in almost
conditions,
kernel will sometimes change the current priority to avoid
"priority inversion".
</p>

<p>
The following table shows the priority class for various thread types.
</p>


<i><b>Table 5. Thread Priorities</b></i>
<table border="1" width="50%" cellspacing="0">
<tbody>
<tr>
  <th>Class</th>
  <th>Priority</th>
  <th>Description</th>
</tr>
<tr>
  <td>Critical</td>
  <td>0<br>-<br>14</td>
  <td>High priority tasks</td>
</tr>
<tr>
  <td>Timer</td>
  <td>15</td>
  <td>Timer thread</td>
</tr>
<tr>
  <td>IST</td>
  <td>16<br>-<br>32</td>
  <td>Interrupt threads</td>
</tr>
<tr>
  <td>DPC</td>
  <td>33</td>
  <td>DPC thread</td>
</tr>
<tr>
  <td>Real-time</td>
  <td>34<br>-<br>126</td>
  <td>Realtime Tasks</td>
</tr>
<tr>
  <td>Normal</td>
  <td>127<br>-<br>254</td>
  <td>Normal tasks</td>
</tr>
<tr>
  <td>Idle</td>
  <td>255</td>
  <td>Idle thread</td>
</tr>

</tbody>
</table>






<h3>Thread State</h3>
<p>
Each thread has one of the following states.
</p>

<img alt="Memory Structure" src="img/thread.gif" border="1"
style="width: 430px; height: 314px;"><br>
<i><b>Figure 2. Thread States</b></i>

<ul>
  <li><b>RUN</b>     :Running or ready to run</li>
  <li><b>SLEEP</b>   :Sleep for some event</li>
  <li><b>SUSPEND</b> :Suspend count is not 0</li>
  <li><b>EXIT</b>    :Terminated</li>
</ul>




<p>
A thread is always preemptive even in kernel mode.
There are following 4 events to switch thread:
</p>

<i><b>Table 6. Events to switch thread</b></i>
<table border="1" width="90%" cellspacing="0">
<tbody>
<tr>
  <th>Event</th>
  <th>Condition</th>
  <th>Run queue position</th>
</tr>
<tr>
  <td><b>Block</b></td>
  <td>Thread sleep or suspend</td>
  <td>Move to the tail of runq</td>
</tr>
<tr>
  <td><b>Preemption</b></td>
  <td>Higher priority thread becomes runnable</td>
  <td>Keep the head of runq</td>
</tr>
<tr>
  <td><b>Quantum Expiration</b></td>
  <td>The thread consumes its time quantum</td>
  <td>Move to the tail of runq</td>
</tr>
<tr>
  <td><b>Yield</b></td>
  <td>The thread releases CPU by itself</td>
  <td>Move to the tail of runq</td>
</tr>

</tbody>
</table>


<h3>Scheduling Policy</h3>
<p>
There are following three types of scheduling policy.
</p>
<ul>
  <li><b>SCHED_FIFO</b>: First-in First-out
  <li><b>SCHED_RR</b>: Round Robin (SCHED_FIFO + timeslice)
  <li><b>SCHED_OTHER</b>: Not supported
</ul>
<p>
In early Prex development phase, SCHED_OTHER was implemented as a traditional
BSD scheduler. Since this scheduler changes the thread priority dynamically,
it is unpredictable and does not fit the real-time system.
Recently, SCHED_OTHER policy was dropped from Prex to
focus on real-time platform.
</p>

<h3>Scheduling Parameter</h3>
<p>
An application program can change the following scheduling parameters via
kernel API.
</p>
<ul>
  <li>Thread Priority</li>
  <li>Scheduling Policy</li>
</ul>

<h3>Scheduling Lock</h3>
<p>
Thread scheduling can be disabled by locking the scheduler.
This is used to synchronize thread execution to protect
accessing to the global resources.
Since an interrupt handler can run while scheduling lock state,
it does not affect to the interrupt latency.
</p>
<p>
The scheduling lock can be performed by sched_lock()/sched_unlock()
kernek functions.
</p>



<h3>Sleep & Wakeup</h3>
<p>
If a thread must wait some events, it should enter sleep state and
release CPU resource for other thread.
The thread can sleep by calling the sched_tsleep() kernel function.
And then, to wakeup the sleeping thread, the thread can use one of the
sched_wakeup() and sched_wakeone().
</p>

The kernel event consists of the queue of the sleeping threads.
Since each event has its own name,
it is easy to know which event the debugee is waiting for.
</p>

<p>
<img alt="Kernel Event" src="img/event.gif" border="1"
style="width: 332px; height: 305px;"><br>

<i><b>Figure 3. Kernel Event</b></i>
</p>



<h3>DPC</h3>
<p>
DPC (Deferred Procedure Call) is used to call the specific
function at some later time with a DPC priority. It is also
known as AST or SoftIRQ in other kernels.  DPC is typically
used by device drivers to do the low-priority jobs without
degrading real-time performance.
</p>
<p>
Each DPC routine is called by the DPC thread which works as a kernel thread.
The requested DPC request is inserted into the DPC queue, and that request
is processed by DPC thread at later time.
All interrupts are enabled and the scheduler is unlocked
when the DPC routine is called.
</ul>
</p>




<h2 id="memory">Memory Management</h2>

<h3>Physical Page Allocator</h3>
<p>The physical page allocator provides the service for
page allocation/deallocation/reservation.
It works as a bottom layer for other memory managers.
</p>
<p>
<img alt="Memory Structure" src="img/memory.gif" border="1"
style="width: 448px; height: 308px;"><br>

<i><b>Figure 4. Prex Memory Structure</b></i>
</p>
<p>
The important point here is that the Prex kernel does not swap out
any pages to the disk devices. This is a significant design
policy to obtain real-time performance and system simplicity.

</p>

<h3>Kernel Memory Allocator</h3>
<p>
The kernel memory allocator is optimized for the small
memory foot print system.
</p>
<p>
To allocate kernel memory, it is necessary to divide one page into
two or more blocks.
There are following 3 linked lists to manage used/free blocks for kernel
memory.
<ol>
  <li>All pages allocated for the kernel memory are linked.</li>
  <li>All blocks divided in the same page are linked.</li>
  <li>All free blocks of the same size are linked.</li>
</ol>
<p>
Currently, it can not handle the memory size exceeding one page.
Instead, a driver can use page_alloc() to allocate large memory.
<br>
When the kernel code illegally writes data into non-allocated memory,
the system will crash easily. The kmem modules are called from
not only kernel code but from various drivers. In order to
check the memory over run, each free block has a tag with magic ID.
</p>
<p>
The kernel maintains the array of the block headers for the free blocks.
The index of an array is decided by the size of each block.
All block has the size of the multiple of 16.
</p>
<pre>free_blks[0] = list for 16 byte block
free_blks[1] = list for 32 byte block
free_blks[2] = list for 48 byte block
     .
     .
free_blks[255] = list for 4096 byte block
</pre>
<p>
In general, only one list is used to search the free block
for a first fit algorithm.
However, the Prex kernel memory allocator is using multiple lists
corresponding to each block size.
A search is started from the list of the requested size. So,
it is not necessary to search smaller block's list wastefully.
</p>
<p>
In most of the "buddy" based memory allocators, their algorithm are
using <b>2^n</b> bytes as block size.
But, this logic will throw away much memory in case
the block size is not fit. So, this is not suitable for the
embedded systems that Prex aims to.
</p>

<h3>Virtual Memory Manager</h3>
<p>
A task owns its private virtual address space. All threads
in a same task share one memory space.
When new task is made, the address map of the parent task will be
automatically copied.
In this time, the read-only space is not copied and is shared with old map.
</p>
<p>
A kernel provides the following functions for VM:
</p>
<ul>
  <li>Allocate/deallocate memory region</li>
  <li>Change memory attribute (read/write/exec)</li>
  <li>Map another task's memory to current task</li>
</ul>
<p>
The VM allocator is using the traditional list-based algorithm.
</p>
<p>
The kernel task is a special task which has the virtual memory mapping
for kernel. All other user mode tasks will have the same kernel memory
image mapped from the kernel task. So, kernel threads can work with the
all user mode task context without switching memory map.
</p>
<p>
<img alt="Memory Mapping" src="img/memmap.gif" border="1"
style="width: 504px; height: 271px;"><br>

<i><b>Figure 5. Kernel Memory Mapping</b></i>
</p>

<p>
Since the Prex kernel does not do page out to an external storage,
it is guaranteed that the allocated memory is always continuing
and existing. Thereby, a kernel and drivers can be constructed
very simply.
</p>
<p>
<i>Note: "Copy-on-write" feature was supported with the Prex kernel before.
But, it was dropped to increase the real-time performance.</i>
</p>


<h2 id="ipc">IPC</h2>
<p>
The message passing model of Prex is very simple compared with other
modern microkernels. The Prex message is sent to the "object" from thread
to thread.
The "object" is similar concept that is called as "port"
in other microkernel.
</p>

<h3>Object</h3>
<p>
An object represents service, state, or policies etc.
For the purpose of object manipulation, the kernel provide 3 basic functions:
object_create(), object_delete(), object_lookup().

The Prex task will create an object to publish its interface to other tasks.
For example, server tasks will create objects like "proc", "fs", "exec" to
allow clients to access their services.
And then, client tasks will send a request message to these objects.
</p>
<p>
An actual entity of the object is stored in kernel space, and it is protected
from user mode code.
The object data is managed with the hash table by using its name string.
Usually, an object has a unique name within a system. To send a
message to the specific object, the sender must obtain the ID of the target object
by using object_lookup().
</p>
<p>
An object can be created without its name. These objects can be used as
private objects for threads in the same task.
</p>

<h3>Message</h3>
<p>
Each IPC message must include the message header in it.
The kernel will automatically store the sender task's ID into the message header.
This mechanism ensures the receiver task can get the exact task ID
of the sender task. Therefore, receiver task can check the sender
task's capability for various secure services.
</p>

<pre>
struct msg_header {
        task_t  task;           /* id of send task */
        int     code;           /* message code */
        int     status;         /* return status */
};
</pre>

<p>
It is necessary to recognize the pre-defined message format between
sender and receiver.
</p>
<p>
Messages are sent to the specific object using msg_send().
The transmission of a message is always synchronous. This means that
the thread which sent the message is blocked until it receives
a response from another thread. msg_receive() performs reception
of a message. msg_receive() is also blocked when no message is
reached to the target object. The receiver thread must answer the
message using msg_reply() after it finishes processing.
</p>
<p>
The receiver thread can not receive another message until it
replies to the sender. In short, a thread can receive only one
message at once. Once the thread receives message, it can send
another message to different object. This mechanism allows threads
to redirect the sender's request to another thread.
</p>
<p>
A thread can receive a message from the specific object which is
created by itself or thread in same task.
If the message has not arrived, it
blocks until any message comes in. The following figure shows the IPC transmit
sequence of Prex.
</p>

<img alt="ipc queue" src="img/msg.gif" border="1"
style="width: 505px; height: 347px;"><br>

<i><b>Figure 6. IPC Transmit Sequence</b></i>

<h3>Message Transfer</h3>
<p>
The message is copied to task to task directly without kernel
buffering. The memory region of sent message is
automatically mapped to the receiver's memory within kernel.
This mechanism allows to reduce the number of copy time while message
transfer.
Since there is no page out of memory in Prex, we can
copy the message data via physical memory at anytime.
</p>
<img alt="Message transfer" src="img/ipcmap.gif" border="1"
style="width: 459px; height: 321px;"><br>

<i><b>Figure 7. IPC message transfer</b></i>

<h2 id="except">Exception Handling</h2>
<p>
A user mode task can specify its own exception handler with exception_setup().
There are two different types of exception.
</p>
<ul>
  <li><b>H/W exception</b>:
  This type of exception is caused by H/W trap &amp fault. The exception
  will be sent to the thread which caused the trap.
  If no exception handler is specified by the task, it will be
  terminated by kernel.</li>

  <li><b>S/W exception</b>:
  The user mode task can send S/W exception to another task by exception_raise().
  The exception
  will be sent to the thread that is sleeping with exception_wait().
  If no thread is waiting for the exception, the exception is sent
  to the first thread in the target task.</li>
</ul>
<p>
Kernel supports 32 types of exception.
The following pre-defined exceptions are raised by kernel itself.
</p>

<i><b>Table 7. Kernel exceptions</b></i>
<table border="1" width="80%" cellspacing="0">
<tbody>
<tr>
  <th>Exception</th>
  <th>Type</th>
  <th>Reason</th>
</tr>
<tr>
  <td>SIGILL</td>
  <td align="center">H/W</td>
  <td>Illegal instruction</td>
</tr>
<tr>
  <td>SIGTRAP</td>
  <td align="center">H/W</td>
  <td>Break point</td>
</tr>
<tr>
  <td>SIGFPE</td>
  <td align="center">H/W</td>
  <td>Math error</td>
</tr>
<tr>
  <td>SIGSEGV</td>
  <td align="center">H/W</td>
  <td>Invalid memory access</td>
</tr>
<tr>
  <td>SIGALRM</td>
  <td align="center">S/W</td>
  <td>Alarm event</td>
</tr>
</tbody>
</table>

<p>
POSIX emulation library will setup its own exception handler to convert
the Prex exceptions into UNIX signals. It will maintain its own signal mask.
And, it transfer control to the actual POSIX signal handler that is
defined by the user mode process.

</p>

<h2 id="int"> Interrupt Framework</h2>
<p>
Prex defines two different types of interrupt service to
optimize the response time of real-time operation.
</p>

<h3>Interrupt Service Routine (ISR)</h3>
<p>
ISR is started by an actual hardware interrupt. The associated
interrupt is disabled in ICU and CPU interrupt is enabled
while it runs. If ISR determines that its device generates
the interrupt, ISR must program the device to stop the interrupt.
Then, ISR should do minimum I/O operation and return control
as quickly as possible.
ISR will run within the context of current running thread at
interrupt time. So, only few kernel services are available within
ISR. ASSERT() macro can be used to detect the invalid
function call from ISR.
</p>

<h3>Interrupt Service Thread (IST)</h3>
<p>
IST is automatically activated if ISR returns INT_CONTINUE to kernel. It
will be called when the system enters safer condition than ISR.
Any interrupt driven I/O operation should be done in IST not ISR.
Since ISR for same IRQ may be run during IST, the shared data,
resources, and device registers must be synchronized by using
irq_lock().
IST does not have to be reentrant, since it is not interrupted
by same IST itself.
</p>

<h3>Interrupt Nesting &amp Priority</h3>
<p>
Each ISR has its logical priority level, with 0 being the lowest
priority. While one ISR is running, all lower priority interrupts
are masked off.
This interrupt nesting mechanism avoids delaying of
high priority interrupt events.
</p>
<p>
IST is executed as a normal thread dispatched by the scheduler. So,
the interrupt thread which has higher priority is executed first.
The driver writer can specify the thread priority of IST when
IST is attached to the specific interrupt line.
The important point is that even a user mode task can be performed
prior to an interrupt thread.
</p>
<p>
The following figure is the sample of the Prex interrupt processing.
</p>
<p>
<img alt="Interrupt Processing" src="img/irq.gif" border="1"><br>
<i><b>Figure 8. Prex Interrupt Processing</b></i>
</p>
<p>
</p>

<h3>Interrupt Locking</h3>
<p>
irq_lock() &amp irq_unlock() are used to disable all interrupts in
order to synchronize the access to the kernel or H/W resource. Since irq_lock()
increments a lock counter, irq_unlock() will automatically restore
to the original interrupt state when locking count becomes 0.
So the caller does not have to save the previous interrupt state.
</p>

<h3>Interrupt Stack</h3>
<p>
If each ISR uses the kernel stack of the current running thread, the
stack area may be over-flow when continuous interrupts are occurred at
one same thread. So the kernel stack will be switched to the dedicated stack
while ISR is running.
</p>


<h2 id="timer">Timer</h2>

<h3>Kernel Timers</h3>

<p>
The kernel timer provides the following feature.
</p>
<ul>
  <li><b>Sleep timer</b>:      Put caller thread to sleep in the specified time.</li>
  <li><b>Call back timer</b>:  Call the routine after specified time passes.</li>
  <li><b>Periodic timer</b>:   Call the routine at the specified interval.</li>
</ul>

<h3>Timer Jitter</h3>

<p>
The periodic timer is designed to minimize the deviation between desired and
actual expiration.
</p>

<h2 id="device">Device I/O Service</h2>

The Prex device driver module is separated from the kernel, and this module
is linked with the kernel at the boot time.
The kernel provides only simple and minimum services to help the 
communication between applications and drivers.

<h3>Device Object</h3>
<p>
Since the Prex kernel does not have the file system in it, the kernel
provides a device object service for I/O interface.
The device object is created by the device driver to communicate to the
application. Usually, the driver creates a device object for an existing
physical device. But, it can be used to handle logical or virtual devices.
</p>

<h3>Driver Interface</h3>
<p>
The interface between kernel and drivers are defined clearly as
"Driver Kernel Interface". The kernel provides the following services
for device drivers.
</p>
<ul>
  <li>Device object service</li>
  <li>Kernel memory allocation</li>
  <li>Physical page allocation</li>
  <li>Interrupt handling service</li>
  <li>Scheduler service</li>
  <li>Timer service</li>
  <li>Debug service</li>
</ul>

<h3>Application Interface</h3>
<p>
The kernel device I/O interfaces are provided to access the specific device
object which is handled by a driver. 
The Prex kernel provides the following 5 functions for applications.
</p>
<ul>
 <li>Open a device</li>
 <li>Close a device</li>
 <li>Read from a device</li>
 <li>Write to a device</li>
 <li>Device I/O control</li>
</ul>

<h2 id="mutex">Mutex</h2>
<h3>Priority Inheritance</h3>
<p>
The thread priority is automatically changed at one of the following conditions.
</p>
<ol>
  <li>
  When the current thread fails to lock the mutex and the mutex
  owner has lower priority than current thread, the priority
  of mutex owner is boosted to the current priority.
  If this mutex owner is waiting for another mutex, such related
  mutexes are also processed.
  </li>
  <li>
  When the current thread unlocks the mutex and its priority
  has already been boosted, kernel recomputes the current priority.
  In this case, the priority is set to the highest
  priority among the threads waiting for the mutexes locked by the
  current thread.
  </li>
  <li>
   When the priority is changed by the user request, the related (inherited)
   thread's priority is also changed.
  </li>
</ol>
<p>
There are following limitations about priority inheritance
with Prex mutex.
</p>
<ol>
  <li>
  If the priority is changed by the user request, the priority
  recomputation is done only when the new priority is higher
  than old priority. The inherited priority is reset to base
  priority when the mutex is unlocked.
  </li>
  <li>
  Even if thread is killed with mutex waiting, the related
  priority is not adjusted.
  </li>
</ol>

<h2 id="debug">Debug</h2>
There are following debugging support functions:
<ul>
  <li>printf(): Display the debug message in kernel.</li>
  <li>panic(): Dump processor registers and stop system.</li>
  <li>ASSERT(): If expression is false (zero), stop system and display information.</li>
</ul>

<p>
The output routine for printf() is initially set to the function
in the architecture dependent layer. However, the device driver can
override this routine by using debug_attch() service.
</p>

<h2 id="arch">Architecture Dependent Layer</h2>

<p>
The interface to the architecture dependent layer is strictly defined
by the Prex kernel. This interface is designed carefully to support various
different architectures with minimum code changes.
So, it is easy to port the Prex kernel to different architecture.
</p>

<p>
The following functions must be provided by the architecture dependent layer.
</p>

<i><b>Table 8. H/W Interface</b></i>
<table border="1" width="70%" cellspacing="0">
<tbody>

<tr>
  <th width="80">Class</th>
  <th width="100">Methods</th>
  <th>Description</th>
</tr>

<tr>
  <td rowspan="5">context</td>
  <td>init</td>
  <td>Initialize specified context</td>
</tr>
<tr>
  <td>set</td>
  <td>Set specific register value</td>
</tr>
<tr>
  <td>switch</td>
  <td>Switch context</td>
</tr>
<tr>
  <td>save</td>
  <td>Save user mode context</td>
</tr>
<tr>
  <td>restore</td>
  <td>Restore user mode context</td>
</tr>

<tr>
  <td rowspan="7">interrupt</td>
  <td>enable</td>
  <td>Enable all interrupts</td>
</tr>
<tr>
  <td>disable</td>
  <td>Disable all interrupts</td>
</tr>
<tr>
  <td>save</td>
  <td>Save current interrupt state</td>
</tr>
<tr>
  <td>restore</td>
  <td>Restore interrupt state</td>
</tr>
<tr>
  <td>mask</td>
  <td>Mask specific interrupt</td>
</tr>
<tr>
  <td>unmask</td>
  <td>Unmask specific interrupt</td>
</tr>
<tr>
  <td>setup</td>
  <td>Program interrupt mode (edge/level)</td>
</tr>

<tr>
  <td rowspan="6">mmu (*)</td>
  <td>init</td>
  <td>Initialize MMU</td>
</tr>
<tr>
  <td>map</td>
  <td>Map physical memory</td>
</tr>
<tr>
  <td>newmap</td>
  <td>Creates new page mapping</td>
</tr>
<tr>
  <td>delmap</td>
  <td>Delete all page mapping</td>
</tr>
<tr>
  <td>switch</td>
  <td>Switch to new page directory</td>
</tr>
<tr>
  <td>extract</td>
  <td>Return physical address</td>
</tr>

<tr>
  <td>clock</td>
  <td>init</td>
  <td>Initialize clock timer device</td>
</tr>


<tr>
  <td rowspan="3">umem</td>
  <td>copyin</td>
  <td>Copy data to kernel area</td>
</tr>
<tr>
  <td>copyout</td>
  <td>Copy data to user area</td>
</tr>
<tr>
  <td>strnlen</td>
  <td>Get the string length in user area</td>
</tr>


<tr>
  <td rowspan="2">diag</td>
  <td>init</td>
  <td>Initialize diagnostic port</td>
</tr>
<tr>
  <td>print</td>
  <td>Put diag message</td>
</tr>

<tr>
  <td rowspan="3">machine</td>
  <td>init</td>
  <td>Initialize basic h/w</td>
</tr>
<tr>
  <td>idle</td>
  <td>Set system to idle state</td>
</tr>
<tr>
  <td>reset</td>
  <td>Reset system</td>
</tr>

</tbody>
</table>

<p>
<i>*) In case of no-MMU system, MMU related routines will be
defined as no-operation routine.
So, the kernel common layer can not assume MMU is always available.</i>
</p>

    </tr>
    <tr>
      <td id="footer" colspan="2" style="vertical-align: top;">
        <a href="http://sourceforge.net">
        <img src="http://sourceforge.net/sflogo.php?group_id=132028&amp;type=1"
        alt="SourceForge.net Logo" border="0" height="31" width="88"></a><br>
        Copyright&copy; 2005-2007 Kohsuke Ohtani
      </td>
    </tr>

  </tbody>
</table>

</div>
<div id="bottom"></div>

</body>
</html>
